33 research outputs found

    Using nondeterministic learners to alert on coffee rust disease

    Get PDF
    Motivated by an agriculture case study, we discuss how to learn functions able to predict whether the value of a continuous target variable will be greater than a given threshold. In the application studied, the aim was to alert on high incidences of coffee rust, the main coffee crop disease in the world. The objective is to use chemical prevention of the disease only when necessary in order to obtain healthier quality products and reductions in costs and environmental impact. In this context, the costs of misclassifications are not symmetrical: false negative predictions may lead to the loss of coffee crops. The baseline approach for this problem is to learn a regressor from the variables that records the factors affecting the appearance and growth of the disease. However, the number of errors is too high to obtain a reliable alarm system. The approaches explored here try to learn hypotheses whose predictions are allowed to return intervals rather than single points. Thus,in addition to alarms and non-alarms, these predictors identify situations with uncertain classification, which we call warnings. We present 3 different implementations: one based on regression, and 2 more based on classifiers. These methods are compared using a framework where the costs of false negatives are higher than that of false positives, and both are higher than the cost of warning prediction

    A simple and efficient method for variable ranking according to their usefulness for learning

    Get PDF
    The selection of a subset of input variables is often based on the previous construction of a ranking to order the variables according to a given criterion of relevancy. The objective is then to linearize the search, estimating the quality of subsets containing the topmost ranked variables. An algorithm devised to rank input variables according to their usefulness in the context of a learning task is presented. This algorithm is the result of a combination of simple and classical techniques, like correlation and orthogonalization, which allow the construction of a fast algorithm that also deals explicitly with redundancy. Additionally, the proposed ranker is endowed with a simple polynomial expansion of the input variables to cope with nonlinear problems. The comparison with some state-of-the-art rankers showed that this combination of simple components is able to yield high-quality rankings of input variables. The experimental validation is made on a wide range of artificial data sets and the quality of the rankings is assessed using a ROC-inspired setting, to avoid biased estimations due to any particular learning algorith

    Optimizing different loss functions in multilabel classifications

    Get PDF
    Multilabel classification (ML) aims to assign a set of labels to an instance. This generalization of multiclass classification yields to the redefinition of loss functions and the learning tasks become harder. The objective of this paper is to gain insights into the relations of optimization aims and some of the most popular performance measures: subset (or 0/1), Hamming, and the example-based F-measure. To make a fair comparison, we implemented three ML learners for optimizing explicitly each one of these measures in a common framework. This can be done considering a subset of labels as a structured output. Then, we use structured output support vector machines tailored to optimize a given loss function. The paper includes an exhaustive experimental comparison. The conclusion is that in most cases, the optimization of the Hamming loss produces the best or competitive scores. This is a practical result since the Hamming loss can be minimized using a bunch of binary classifiers, one for each label separately, and therefore, it is a scalable and fast method to learn ML tasks. Additionally, we observe that in noise-free learning tasks optimizing the subset loss is the best option, but the differences are very small. We have also noticed that the biggest room for improvement can be found when the goal is to optimize an F-measure in noisy learning task

    A heuristic for learning decision trees and pruning them into classification rules

    Get PDF
    Let us consider a set of training examples described by continuous or symbolic attributes with categorical classes. In this paper we present a measure of the potential quality of a region of the attribute space to be represented as a rule condition to classify unseen cases. The aim is to take into account the distribution of the classes of the examples. The resulting measure, called impurity level, is inspired by a similar measure used in the instance-based algorithm IB3 for selecting suitable paradigmatic exemplars that will classify, in a nearest-neighbor context, future cases. The features of the impurity level are illustrated using a version of Quinlan's well-known C4.5 where the information-based heuristics are replaced by our measure. The experiments carried out to test the proposals indicate a very high accuracy reached with sets of classification rules as small as those found by RIPPE

    Analysis of nutrition data by means of a matrix factorization method

    Get PDF
    We present a factorization framework to analyze the data of a regression learning task with two peculiarities. First, inputs can be split into two parts that represent semantically significant entities. Second, the performance of regressors is very low. The basic idea of the approach presented here is to try to learn the ordering relations of the target variable instead of its exact value. Each part of the input is mapped into a common Euclidean space in such a way that the distance in the common space is the representation of the interaction of both parts of the input. The factorization approach obtains reliable models from which it is possible to compute a ranking of the features according to their responsibility in the variation of the target variable. Additionally, the Euclidean representation of data provides a visualization where metric properties have a clear semantics. We illustrate the approach with a case study: the analysis of a dataset about the variations of Body Mass Index for Age of children after a Food Aid Program deployed in poor rural communities in Southern MĂ©xico. In this case, the two parts of inputs are the vectorial representation of children and their diets. In addition to discovering latent information, the mapping of inputs allows us to visualize children and diets in a common metric spac

    Learning to assess from pair-wise comparisons

    Get PDF
    In this paper we present an algorithm for learning a function able to assess objects. We assume that our teachers can provide a collection of pairwise comparisons but encounter certain difficulties in assigning a number to the qualities of the objects considered. This is a typical situation when dealing with food products, where it is very interesting to have repeatable, reliable mechanisms that are as objective as possible to evaluate quality in order to provide markets with products of a uniform quality. The same problem arises when we are trying to learn user preferences in an information retrieval system or in configuring a complex device. The algorithm is implemented using a growing variant of Kohonen’s Self-Organizing Maps (growing neural gas), and is tested with a variety of data sets to demonstrate the capabilities of our approac

    Using machine learning procedures to ascertain the influence of beef carcass profiles on carcass conformation scores

    Get PDF
    In this study, a total of 163 young-bull carcasses belonging to seven Spanish native beef cattle breeds showing substantial carcass variation were photographed in order to obtain digital assessments of carcass dimensions and profiles. This dataset was then analysed using machine learning (ML) methodologies to ascertain the influence of carcass profiles on the grade obtained using the SEUROP system. To achieve this goal, carcasses were obtained using the same standard feeding regime and classified homogeneous conditions in order to avoid non-linear behaviour in grading performance. Carcass weight affects grading to a large extent and the classification error obtained when this attribute was included in the training sets was consistently lower than when it was not. However, carcass profile information was considered non-relevant by the ML algorithm in earlier stages of the analysis. Furthermore, when carcass weight was taken into account, the ML algorithm used only easy-to-measure attributes to clone the classifiers decisions. Here we confirm the possibility of designing a more objective and easy-to-interpret system to classify the most common types of carcass in the territory of the EU using only a few single attributes that are easily obtained in an industrial environment
    corecore